biostat_corefandomcom-20200215-history
Dataset Management
Keeping consistent rules for how data is recorded is crucial for accurate data analysis. When many different people enter in data, each person may have a different idea of how the data should be entered; this can at the very least delay analysis, and at the worst make analysis impossible. One of the most effective steps to keep data organized in consistent manner is to maintain a data codebook. The main components of a good data codebook: # Variable names and descriptions of all variables in the dataset # Type of data in each variable (numerical, categorical, text) # Description of numerical codes for categorical variables (i.e. 1 = White, 2 = Black, 3 = Asian) # How missing data is indicated (NA, ., -) Codebook Example Having a well defined codebook similar to this one can greatly help investigators, research assistants and data analysts. Depending on what program you are using to enter in your data, this functionality may already be built in. Here are examples for REDCapand SPSS. If you are manually entering data into an Excel Spreasheet, a codebook is something you would need to develop on your own, but following a format similar to the one above can be quite useful. Inconsistent Coding of Missing Data One of the most commonly seen challenges to accurate analysis is not having a consistent way to mark what values are missing. The example dataset below shows a common issue seen. Based on this data alone there are two likely possibilities for the subjects 5149 and 6047 # Diabetes status is unknown or missing. Presumably these should have been marked as "N/A", but someone entering data may have left it blank since they did not know to enter N/A. # Subjects do not have diabetes. Especially for binary (Yes/No) variables, data may sometimes be entered only if they have the condition, and left blank otherwise. For the analyst, especially one who was not involved in data collection, it is impossible to know which of these options is the correct one. If an investigator wanted to know what was the percentage of people in the study who had diabetes, more information would be needed about how the data was recorded to get an accurate answer. Common Ways to Denote Missing Data # . (Period) - Default character for missing data in SAS # NA - Default for missing data in R # 99999 or some implausibly high number - Commonly used in survey datasets in order to keep all the entries in the column be numerical. However, this should be clearly noted in a codebook, otherwise it could lead to inaccurate analysis. # (Blank) - May be fine to use in certain situations, but can be confusing. Some may code a certain outcome to be: 1 = Yes; (Blank) = No; which may be difficult to differentiate who has "No" for the outcome and who has an unknown. # 0''' - Generally '''not recommended, especially for a numerical variable where 0 is a plausible value. Also not recommended for binary (Yes or No) variables since the common coding is 1 = Yes, 0 = No.